### Configuration specs for running Anomaly Detection as a Watson Studio Job

This file describes various configuration settings to drive the Anomaly Detection notebook in the auto pilot mode. The steps to use the Anomaly Detection (Config Based).ipynb using external settings, use the file `ad_config_standard.json` in the `data` folder as the basis, and modify the parameters according to your dataset. The description of each parameter and their allowed values are provided below

|Parameter|Description|Allowed Values|System Default|
|---------|-----------|--------------|--------------|
data_source|Indicates if the data for training is fetched from the MAS monitor data lake or CSV files |mas_monitor_data_lake or CSV |mas_monitor_data_lake    
asset_group_label|The label for the asset group in Maximo|User's choice. It should match the value in maximo; for example: IIOT_DEVICE_AD_EXPLNR| None, but this is mandatory        
device_type|Device type used at the Monitor level|User's choice; for example MAS_IIOT_DEVICE_C1|None, but mandatory        
sensor_data_file|CSV file containing the sensor measurements, asset Id and timestamps. Should be available in Watson Studio project's data assets|User's choice. For example: sensor_train_unsupervised.csv |None, but mandatory if the parameter `data_source` is CSV        
validation_data_file|CSV file with the labled data. Should be available in Watson Studio project's data assets |User's choice. For example: sensor_train_labeled.csv |None, but mandatory if data_source is CSV and the parameter `learning_mode` is set to semi-supervised     
asset_id_column_name|The column of the data frame that contains the Id of the asset|User's choice|id    
timestamp_column_name|The column of the data frame that contains the timestamp when the measurements in that particular row were recorded|User's choice|evt_timestamp      
source_variables|A **list** of column names containing the raw variables excluding the asset Id and timestamp columns. This should be provided as a list of values|Depends on the user's data|None, but mandatory. If not provided, the system will try to infer from the data frame          
data_resampling|If the training data needs to be resampled, provide the value for the sampling window in time units|Any value using Pandas timeseries alias | If this parameter is not provided, no resampling will be done. This is not a mandatory parameter. So no system default       
learning_mode|Choice of unsupervised vs semi-supervised approach. If semi-supervised approach is chosen, labelled data should be provided|unsupervised or semi-supervised|unsupervised    
enable_temporal_features|Flag indicating if temporal / statistical features are needed|"True" or "False" as string|False. By default no features will be created    
rolling_window_size|rolling window size for temporal features. Not to be mixed with the `data_resampling` above|Any value using pandas timeseries alias|None. This is not a mandatory parameter. However if `enable_temporal_features` is set to "True" a suitable value for this parameter must be provided           
minimum_periods|Value for minimum periods. Refer to Pandas decumentation|Pandas aliases|None. This is not a mandatory parameter    
simple_aggregation_functions|Simple Temporal statistics|Provide this input as a list containing one or more labels from this string [`min`, `max`, `mean`, `std`,<br /> `sum`, `count`,`median`]|None. This is not a mandatory parameter. But if `enable_temporal_features` is set to "True" then atleast one of the aggregation functions must be provided - either this one or the `higher_order_aggregation_functions` or `advanced_aggregation_functions`. All three sets of aggregation functions can also be used     
higher_order_aggregation_functions| Higher order temporal statistical functions  | Provide this input as a list containing one or more labels from this string [`sum`,`skew`,`kurt`,`quantile_25`,<br />`quantile_75`,`quantile_range`]|None.  This is not a mandatory parameter. But if `enable_temporal_features` is set to "True" then atleast one of the aggregation functions must be provided - either this one or the `simple_aggregation_functions` or `advanced_aggregation_functions`. All three sets of aggregation functions can also be used    
advanced_aggregation_functions|Defines advanced aggregation functions to be computed|[`rate_of_change`,`sum_of_change`,`absolute_sum_of_changes`,<br /> `trend_slop`,`abs_energy`,`mean_abs_change`,`mean_change`,<br /> `mean_second_derivate_central`,`count_above_mean`,`count_below_mean`] |None.  This is not a mandatory parameter. But if `enable_temporal_features` is set to "True" then atleast one of the aggregation functions must be provided - either this one or the `simple_aggregation_functions` or `higher_order_aggregation_functions`. All three sets of aggregation functions can also be used       
wml_deployment|Flag indicating WML deployment is needed|"True" or "False"|"True"    
wml_deployment_space_name|The name of the WML deployment space. Should be pre-created from Watson Studio|Any user specified name as per WML requirements|MAS-Testing-Deployment-Space    
wml_model_type|A string representing the type of the trained model|One of the WML aliases appropriate for this model|scikit-learn_1.3    
wml_base_software_spec_name|A string identifyng the closest software specification defined in WML. The custom software spec will be defined based on this|One of the WML aliases appropriate for this model|runtime-24.1-py3.11   
data_quality_analysis|Turning this on will perform data quality analysis checking for missing values. For the definition of this spec, please see the example below at the end of this table|"True" or "False" as string|True   
auto_imputation_config|Turning this on will perform imputation of missing data. For the definition of this configuration, please see the example below|If this is defined, it means auto imputation is desired. This configuration is defined inside the data quality analysis block as shown in the example below|Not defined. This means by default auto imputation will not be performed             
execution_params|Defines a group of parameters for the ML pipelines to execute. This block involves additional parameters|Example shown at the end of this page, but all the parameters have defaults if this whole block is not provided / configured|Defaults apply for each parameter in this block    
execution_type|Related to hyper parameter search|one of the strings from the set [`single_node_complete_search`, `single_node_random_search`,<br /> `spark_node_complete_search`,`spark_node_random_search`, <br />`evolutionary_search`,`hyperband_search`,<br />`rbfopt_search`,`bayesian_search`]|spark_node_random_search    
number_of_option_per_pipeline|Number of parameter settings that are sampled. This parameter is applicable for "spark_node_random_search" and "single_node_random_search" exectype|Integer value|10         
maximum_evaluation_time_per_pipeline|Defined as minutes. Maximum timeout for execution of pipelines with unique parameter grid combination. This parameter is applicable for "spark_node_random_search" and "spark_node_complete_search" exectype.|An integer value|30     
total_execution_time|Defined as minutes. The time it takes for finishing the execution of all the pipelines. This needs to be larger than the `maximum_evaluation_time_per_pipeline` parameter|Integer value|70        
random_state|A random number to force the consistency in the behaviour and results|An integer value|42        
log_level|Controls the logging level|one of the values from the list ['low', 'medium', 'high']|'low'     
scaling|The type of scaler needed|one of the values from ['Normalize', 'Standardize', 'Robust']|Normalize    
normalization_range|This applies if Normalize is used as the `scaling`|Two integers representing the lower and upper values of the range (inclusive of those two values) provided as a list|[0,1]     
scoring_method|Scoring method for Unsupervised model training|One of the values from the list ['em_score', 'mv_score', 'al_score']|em_score       
threshold_criterion|Criterion for computation of the anomaly threshold|See the example below|{"std":[2.0]}     
estimator_selection_criteria|Given many estimators will be identified each predicting different number of anomalies, this config drives the selection criterion. 'minimum_anomalies' will result in the selection of the estimator that identifies the minimum, non-zero number of anomalies in the training data. 'maximum_anomalies' will do the opposite|One of the values in the list ['minimum_anomalies', 'maximum_anomalies']|minimum_anomalies    
include_extended_algorithms|This includes broader set of ML algorithms. Setting this to "True" will bring in a more comprehensive set of estimators to bear, and will take more time to finish the training|"True" or "False" as a string|"False"       
include_covariance_based_techniques|Setting this to "True" will bring in covariance based estimators to bear, and will take more time to finish the training. Enabling this will increase the training time for each additional columns / features in the training data set|"True" or "False" as a string|"False"    
use_specific_estimators|This will let the user / data scientist pick specific sklearn estimators to override the defaults|One or more of the values from the list [`isolationforest`,`nearestneighboranomalymodel`,<br /> `mincovdet`,`anomalyensembler`,`nsa`,<br /> `predictonly_anomalyensembler`,`anomalyrobustpca`, <br /> `extendedisolationforest`,`lofnearestneighboranomalymodel`,<br /> `neuralnetworknsa`,`anomalypca_t2`,<br />`anomalypca_q`,`samplesvdd`,`empiricalcovariance`,<br /> `ellipticenvelope`,`ledoitwolf`,`oas`,<br />`shrunkcovariance`,`oneclasssvm`,`gaussiangraphicalmodel`,<br />`gmmoutlier`,`cusum`,`kerneldensity`,`graphpgscps`,<br /> `hotellingt2`,`spad`,`extendedspad`,`oob`,<br />`DNNAutoEncoder`,`ggm_snn`,`ggm_kl_div_dist`,<br />`ggm_kl_divergence`,`ggm_frobenius_norm`,`ggm_likelihood`,<br /> `ggm_spectral`,`ggm_mahalanobis`,`ggm_sparse_subgraph`,<br />`graphquic`, `randompartitionforest`]|[isolationforest, nearestneighboranomalymodel, <br /> lofnearestneighboranomalymodel, anomalyensembler, anomalyrobustpca,<br /> anomalypca_t2, anomalypca_q, oneclasssvm, gmmoutlier]       
exclude_specific_estimators|The user / data scientist can pick specific estimators to exclude from the standard stack. The same stack provided above can be used to create an exclusion list|The same list of estimators shown above|[]         
monitor_deployment|Setting this block of configuration will enable deployment in MAS monitor|Example is shown at the end of this table|default is shown in the example below   
model_instance_name|A user defined name for the trained model to deploy in monitor|Any user specified string|Anomaly Detection 1    
model_instance_desc|A user defined description for the trained model to deploy in monitor|Any user specified string|Anomaly Detection Using MAS Predict     
write_initial_result|Setting this to "True" will make the system write the training results to the database|"True" or "False" as string|"True"      
model_upgrade|Setting this to "True" will enable the system upgrade / overwrite the previous model upon retraining|"True" or "False" as a string|False      
enable_model|Setting this to "True" will immediately enable the model for scoring|"True" or "False" as a string|"True"      
scoring_schedule|This configuration is required to define a scoring schedule|See the example below|No defaults     



+ Data Quality Analysis Config example (including the embedded Auto Config example)

```      
"data_quality_analysis":{"missing_value_analysis": "True",
    "missing_value_thresholds": {"var_1": 0.25, "var_2": 0.4, "var_3": 0.15,"var_4":0.15,"var_5":0.20,"var_6":0.05,"var_7":0.1,"var_8":0.3},
    "stop_if_missing_values_exceed_threshold": "True",
"auto_imputation_config": {"level": "default", "execution_platform": "spark_node_random_search", "use_mcar": "True"}},    
```    

+ Execution Params example      

```      
"execution_params":{"execution_type":"spark_node_random_search",
"number_of_option_per_pipeline":1,
"maximum_evaluation_time_per_pipeline":30,
"total_execution_time":70,
"random_state":42,
"log_level":10,
"scaling":"Normalize",
"normalization_range":[0,1],
"scoring_method":"em_score",
"threshold_criterion":{"std":[2.0]},
"estimator_selection_criteria":"minimum_anomalies",
"include_extended_algorithms":"False",
"include_covariance_based_techniques":"False",
"use_specific_estimators":["isolationforest", "nearestneighboranomalymodel", "lofnearestneighboranomalymodel", "anomalyensembler", "anomalyrobustpca", "anomalypca_t2", "anomalypca_q", "oneclasssvm", "gmmoutlier"],
"exclude_specific_estimators":[]},    

```    

+ Monitor Deployment example (including scoring ferquency)   

```     
"monitor_deployment":{
"model_instance_name":"Anomaly Detection 1",
"model_instance_desc":"Anomaly Detection Using MAS Predict",
"write_initial_result":"True",
"model_upgrade":"False",
"enable_model":"True",
"scoring_schedule":{"starting_at": "in_5_minutes", "every": "1D"}}     
```     
